STAT 341/641: Intro to EDA and Statistical Computing
Lab #1: Ggplot
Teaching Assistant: “Yanjun Liu”
Directions: The following contains tasks you must complete to receive full credit for this homework. Consult the R markdown cheatsheet on canvas if you have questions about markdown syntax.
#Task One: Chapter 3 of R for Data Science
Navigate to https://r4ds.had.co.nz/. You will work through sections 3.1 to 3.10 in this laboratory. Replicate each computation performed in the chapter and answer the associated questions.
##3.1: Introduction.
Solution: (Write your code in the following block. You can add additional blocks to in order to write text between the blocks.)
r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)
install.packages("tidyverse")
## Warning: unable to access index for repository http://cran.us.r-project.org/src/contrib:
## cannot open URL 'http://cran.us.r-project.org/src/contrib/PACKAGES'
## Warning: package 'tidyverse' is not available (for R version 3.5.1)
## Warning: unable to access index for repository http://cran.us.r-project.org/bin/macosx/el-capitan/contrib/3.5:
## cannot open URL 'http://cran.us.r-project.org/bin/macosx/el-capitan/contrib/3.5/PACKAGES'
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.5.2
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'tidyr' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.2
## Warning: package 'dplyr' was built under R version 3.5.2
## Warning: package 'stringr' was built under R version 3.5.2
## Warning: package 'forcats' was built under R version 3.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##3.2: First steps
Solution:
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manual… f 21 29 p comp…
## 3 audi a4 2 2008 4 manual… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto(a… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto(l… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manual… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto(a… f 18 27 p comp…
## 8 audi a4 quat… 1.8 1999 4 manual… 4 18 26 p comp…
## 9 audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p comp…
## 10 audi a4 quat… 2 2008 4 manual… 4 20 28 p comp…
## # … with 224 more rows
?mpg
ggplot(data = mpg) + geom_point(mapping = aes(x = displ,y = hwy))
ggplot(data = mpg) + geom_point(mapping = aes(x = drv,y = class))
# Exercises ***3.2.4***
# Question 1) ggplot(data = mpg) does not display any data
# Question 2) 234 Rows and 11 columns
# Question 3) Front wheel drive/Rear wheel drive/ 4 wheel drive car
# Question 4)
ggplot(data = mpg) + geom_point(mapping = aes(x = cyl,y = hwy))
# Question 5) The plot is not useful because it comparing two categorical variables which doesnot give us any useful data at all
##3.3: Aesthetic mappings
Solution:
# Mapping class to color
ggplot(data = mpg) + geom_point(mapping = aes(x = displ,y = hwy,color = class))
#Class to Size
ggplot(data = mpg) + geom_point(mapping = aes(x = displ,y = hwy,size = class ))
## Warning: Using size for a discrete variable is not advised.
#Class to alpha aesthetic
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy,alpha = class))
## Warning: Using alpha for a discrete variable is not advised.
#Class to Shape
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy,shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).
# ***Exercises 3.3.1***
# Question 1) Because color = "blue" here is considered as a mapping between the two variables, it should be placed outside the aes parantheses
# Question 2) Continuous:displ,year,cyl,cty,hwy; Categorical: model,trans,drv,fl,class
# Question 3)
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y= hwy, color=cyl)) # Same color that varies in transparency
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y= hwy, size=cyl))
# Question 4) Mapping a single variable to multiple aesthetics is bad practice and rather redundant
# Question 5) Stroke changed the size of the border for the shape.
# Question 6) This works and highlights all the values less than 5 as such:
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y= hwy, colour= displ < 5))
##3.4: Common problems
Solution:
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
##3.5: Facets
Solution:
# Playing around with facets
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_wrap(~ class, nrow = 2)
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl)
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(. ~ cyl)
# *** Exercises 3.5.1 ***
# Question 1) The Continous variable is simply treated as a categorical variable
# Question 2) The empty cells in this plot are combinations of drv and cyl that have no obeservations
# Question 3) The symbol '.' ignores the second variable when faceting.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
# Question 4) Advantages: ability to encode more distinct categories, wider range of data (only limited to 9 colors), also handles overlapping better.
# Disadvantages: Difficult to compare values and categories, visually limiting.
# Question 5) nrow = number of rows; ncol = number of columns
#Question 6) There will be more space for columns if the plot is horizontal
##3.6: Geometric objects
Solution:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy)) # Displaying a smooth line plot
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv)) #Adding linetype
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, group = drv)) # Group
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg) + geom_smooth( mapping = aes(x = displ, y = hwy, color = drv),
show.legend = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# Plotting and manipulating geom point + geom smooth
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# ***Exercises 3.6.1***
# Question 1 ) line chart: geom_line()
# boxplot: geom_boxplot()
# histogram: geom_histogram()
# area chart: geom_area()
# Question 2) This code produces a scatterplot with displ on x-axis and hwy on the y axis, and the points are colored drv,without standard error.
# Question 3) The theme option hides the legend box, with three plots, adding a legend would change the size of the last plot, which would create unreliable data.
# Question 4) It adds standard error bands to the lines
# Question 5) No, because both plots use the same data and mappings. So they will use the same options.
# Question 6)
ggplot(data = mpg,mapping = aes(x = displ, y = hwy)) + geom_point()+ geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mpg, aes(x = displ, y = hwy)) + geom_smooth(mapping = aes(group = drv), se = FALSE) + geom_point()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) + geom_point() + geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mpg, aes(x = displ, y = hwy)) +geom_point(aes(colour = drv)) +geom_smooth(aes(linetype = drv), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(size = 4, color = "white") +
geom_point(aes(colour = drv))
##3.7: Statistical transformations
Solution:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
# Overriding data to counts
demo <- tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551
)
ggplot(data = demo) +
geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")
# Proportion
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = stat(prop), group = 1))
# Stat Summary
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
# *** Exercises 3.7.1 ***
# Question 1) Point range plot; geom_pointrange(...)
# Question 2) geom_col() has the default stat of stat_identity(); geom_bar() has the defauly stat of stat_bin()
# Question 3) They have common names usually, and they have each other as default stats.
# Question 4) ymin: lower interval; xmax: upper interval; se: standard error; y: predicted value
# Question 5) geom_bar assumes that the groups are all equal to x so we will have same height
##3.8: Position adjustments
Solution:
# Coloring in bar graphs
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
# Position adjustment
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar(alpha = 1/5, position = "identity")
ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) +
geom_bar(fill = NA, position = "identity")
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
# *** Exercises 3.8.1 ***
# Question 1) This is overplotting, we need to add position = "jitter"
# Question 2) Width and height
# Question 3) geom_jitter adds random variation to the locations points of the graph to improve the accuracy of our data, geom_count sizes the points relative to the number of observations. geom_count creates overlapping if points are close enough together and the size is large.
# Question 4)
ggplot(data = mpg, aes(x = drv, y = hwy)) + geom_boxplot()
##3.9: Coordinate systems
Solution:
# coord_flip
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
# coord_flip
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
#coord_quickmap
install.packages("maps")
## Warning: unable to access index for repository http://cran.us.r-project.org/src/contrib:
## cannot open URL 'http://cran.us.r-project.org/src/contrib/PACKAGES'
## Warning: package 'maps' is not available (for R version 3.5.1)
## Warning: unable to access index for repository http://cran.us.r-project.org/bin/macosx/el-capitan/contrib/3.5:
## cannot open URL 'http://cran.us.r-project.org/bin/macosx/el-capitan/contrib/3.5/PACKAGES'
nz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +coord_quickmap()
# Polar coordinates
bar <- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
show.legend = FALSE,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_flip()
bar + coord_polar()
# *** Exercise 3.9.1 ***
# Question 1)
ggplot(mpg, aes(x = factor(1), fill = drv)) +
geom_bar(width = 1) +
coord_polar(theta = "y")
# Question 2) The labs function adds axis titles and plot titles.
#Question 3) coord_map uses map plots to plot 3D plots onto 2D plane; coord_quickmap uses an approxiamte but faster map projection
# Question 4) coord_fixed makes sure that the line produced by geom_abline is at a 45 degree angle
##3.10: The layered grammar of graphics
Solution:
# No code
#Task Two: The UFO data
Read in the UFO data from canvas. Use ggplot2, and any other commands you know in R to answer the following question: “Is the distribution of UFO shapes similar in American states and Canadian provinces that share a border?”
You may find the text ggplot2 Elegant Graphics for Data Analysis to be helpful.
library(ggplot2)
library(grid)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
ufos <- read_csv("ufos_clean.csv")
## Parsed with column specification:
## cols(
## Year = col_double(),
## Month = col_double(),
## Day = col_double(),
## Hour = col_double(),
## Minute = col_double(),
## State = col_character(),
## Shape = col_character(),
## Duration_minutes = col_double()
## )
#Remove black values for shape
ufos <- subset(ufos, !(Shape == "" ))
# Subset for UFOs seen in American States
ufosUSA <- subset(ufos, !(State %in% c("AB","BC","MB","NB","NF","NS","ON","QC","SK")));
head(ufosUSA)
## # A tibble: 6 x 8
## Year Month Day Hour Minute State Shape Duration_minutes
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 2017 4 19 23 29 CO Light 60
## 2 2017 4 19 0 40 AR Light 120
## 3 2017 4 18 14 30 NY Teardrop 240
## 4 2017 4 16 21 0 UT Circle 240
## 5 2017 4 16 20 0 IL Formation 120
## 6 2017 4 14 22 30 IN Light 60
# Subset for UFOs seen in Canadian Provinces that share a border
ufosCAUSA <- subset(ufos, State %in% c("BC","AB","SK","MB","ON","QC","NB"))
# Data Visualization
USA <- ggplot(data = ufosUSA) + geom_bar(mapping = aes(x = Shape,fill = Shape)) + ggtitle("American State UFO sighting") + theme(plot.title = element_text(size = 10)) +theme(axis.title.x=element_blank(), axis.text.x=element_blank(),axis.ticks.x=element_blank(),legend.title = element_blank(),plot.margin=unit(c(0.8,.8,0.5,.8),"cm"),legend.key.size = unit(.1, "cm"))
CanadaBorder <- ggplot(data = ufosCAUSA) + geom_bar(mapping = aes(x = Shape,fill = Shape)) +ggtitle("Canadian provinces bordering US States") + theme(plot.title = element_text(size = 10)) +theme(axis.title.x=element_blank(), axis.text.x=element_blank(),axis.ticks.x=element_blank(),legend.title = element_blank(),plot.margin=unit(c(.8,.8,0.5,.8),"cm"),legend.key.size = unit(.1, "cm"))
margin = theme(plot.margin = unit(c(2,2,2,2), "cm"))
grid.arrange(USA,CanadaBorder)
# Here we can see that the sightings in the US States and the Canadian borders are very similar, which might either explain that A) People in the US are weird, or B that the closer you get to the US the closer the results get. The data is very similar in shape and very similar in distribution. Both plots explain that the most common shape is the Light shape.